This report explores a data set of 1599 red wines. Various exploration methods are used to analyse which chemical properties influence the quality of red wines.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
The data set contains 1599 red wines in total. There are thirteen variables. We are going to look at twelve variables, as variable “X” is line count. Hence variable “X” is removed from the dataset.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
The best red wines among the data is rated eight, the worst receives a rating of three. The graph shows that red wine quality is normally distributed with peaks at ratings five and six.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The level of alcohol is skewed to the left. Mean of alcohol level is 10.43% and median is 9.5%.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
There are no basic wines and it is normally distributed with mean of 3.31 and median 3.310.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Both fixed acidity and volatile acidity are slightly skewed to the left. The plot for fixed acidity peaks around 7. There are few outliers within the volatile acidity plot. It peaks around 0.6 and the majority of the observations fall between 0.3 and 0.6. Additionally, all red wines contains a very small amount of citric acid and there are a peak at zero and another peak around 0.49.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Total sulfur dioxide is skewed to the left and has a relatively long tail to the right. In order to obtain a better understanding of the pattern, I have used log10 scale on x-axis. There is a peak around 45. The distribution is close to normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
Similar to total sulfur dioxide, the plot is skewed to the left. Logscale is used on x-axis. There is a peak around 6.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Sulphates also skew to the left and there are a few outliers which are greater than 1.5.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
Density has a fairly small range overall. This is expected as wines have density close to water. The distribution of red wine is normal with mean of 0.0067 and median 0.9968.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
THe distribution is skewed to the left. The graph is transformed using a logscale on the x-axis. Majority of red wines have between around 1.6 g/dm^3 to 2.8 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Once again, chlorides plot is skewed to the left. I transformed the long tail data to better understand the distribution of chlorides. The distribution is close to normal if excluding observations above 0.2.
There are 1599 observations within the data set. There 13 variables: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free SO2, total SO2, density , pH, sulphates, alchohol, quality and X(excluded from analysis). All the variables are numerical.
The main feature is the quality in dataset. I would like to determine which chemical properties influence the quality of red wines.
I would like to include residual sugar, chlorides, the various types of acidity, levels of different sulfur dioxides and density in the investigation.
No new variables created so far.
I noted a few variables, residual sugar, cholorides, total sulfur dioxide and free sulfur dioxide, have long tail data. Data transformation is performed using a logscale on the x-axis. I was able to understand the pattern and distribution better. For instance, toal sulfur dioxide has a left skewed plot. After data transformation, it is noted that the distribution of total sulfur dioxide is close to normal distribution with a peak around 45.
In this section, I will find out the correlation between different variables, especially between quality and any other supporting variables. Then, I will pick a few variables to perform further analysis.
The correlation plot above demonstrates correlation between variables. Due to the size of this dataset is fairly small, all the data are included in the plot.
It is noted that those variables have relatively strong correlation with red wine quality are alcohol (0.48) and volatile acidity (-0.39). I will investigate these two variables further within this section.
Similar to the plot above, this plot reveals the correlation between variables in a correlation matrix. The larger the circle, the stronger the correlation while blue means positive correlation and red indicates negative correlation. The rectangles around the chart of correlation matrix is drawn based on the results of hierarchical clustering.
Apart from the correlation between supporting variables and red wine quality, the following correlations are interesting:
alcohol vs. density
fixed.acidity vs. density
residual.sugar vs. density
chlorides vs. sulphates
free sulfur dioxide vs. total sulfur dioxide
pH vs. fixed acidity
pH vs. citric acid
The following relatively strong correlations are within expectation:
free sulfur dioxide vs. total sulfur dioxide - These two variables have strong positive correlation of 0.67. The plot above illustrates positive correlation and the standard error around the trend line is relatively small. However, free sulfur dioxide is part of total sulfur dioxide. Hence it is within norm that these two variables are correlated.
pH vs. fixed acidity - The above plot shows that fixed acidity and pH are negatively correlated ( -0.68) It is expected that an increase in fixed acidity in red wine will bring down the overall pH of the wine. Hence the negative correlation (- 0.68) between these two variables is expected.
A new vaiiable, quality_factor, is created and this variable is a factor. I used “geom_jitter” to create the scatter plot to solve the issue of overlapping observations. From both scatter plot and box plot, it is noted that an increase in alcohol level is correlated to a higher quality of red wine. There is one exception, median alcohol level for rating 5 is smaller than median for rating 4.
From both scatter plot and box plot, it is noted that higher quality red wines is correlated with lower lever of volatile acidity. The boxplot shows a clearer pattern with the median volatile acidity decreasing as quality increases.
A large proportion of observations have 9% to 11% alcohol. Alcohol and density are negatively correlated.
The trend within the plot is clear, as fixed acidity increases, the density also increases.
There are few outliners with large amount of residual sugar hence the top 1% population is excluded from the plot. Although there is an upward trend within the plot, the correlation between residual sugar and density is not very strong. Majority of the observation have less than 4 g/dm^3.
The top plot above exlcuded the top 1% of chlorides to remove outliers within the sample population. Similarly, there is an upward trend in the plot but the correlation between chlorides and sulphates does not appear to be strong. Most of the observations can be found to have between 0.05 g/dm^3 to 0.12 g/dm^3 of chlorides.
In the second plot, only observations with chlorides below 0.15 g/dm^3. There does not appear to be a strong correlation between chlorides and sulphates.
In this section, I have plotted the correlation between variables using two different visualisations. I noted the two variables that have relatively strong correlation to red wine quality are alcohol and volatile acidity. It is noted that alcohol is positively correlated with red wine quality. Whereas an increse in volatile acidity leads to a decrease in red wine quality.
I also noted that alcohol and density is positively correlated. Similarly, fixed acidity and density are also positively correclated.
The strongest relationship I found is between fixed acidity and pH which has a correlation of -0.68.
In the section above, I chose alcohol and volatile acidity to perform further analysis due to their relatively high correlation with red wine quality. In the plots above, all three variables are brought together. Color green represents good quality wines (7 and 8), yellow represents mid-range wines (5 and 6) and red for poor quality wines (3 and 4).For good quality wine, they appears to have volatile acidity between 0.2 and 0.5 and alcohol between 11 and 13. Interestingly, wines with rating 5 sits at the bottom left of the plot. They have volatile acidity between 0.4 to 0.8 and alcohol level below 10.
##
## Calls:
## m1: lm(formula = quality ~ alcohol + volatile.acidity, data = pf)
## m2: lm(formula = quality ~ alcohol + volatile.acidity + density,
## data = pf)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + density +
## fixed.acidity, data = pf)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + density +
## fixed.acidity + residual.sugar, data = pf)
## m5: lm(formula = quality ~ alcohol + volatile.acidity + density +
## fixed.acidity + residual.sugar + chlorides, data = pf)
## m6: lm(formula = quality ~ alcohol + volatile.acidity + density +
## fixed.acidity + residual.sugar + chlorides + sulphates, data = pf)
##
## ===========================================================================================
## m1 m2 m3 m4 m5 m6
## -------------------------------------------------------------------------------------------
## (Intercept) 3.095*** -18.407 15.573 13.785 12.996 41.848*
## (0.184) (10.298) (15.187) (17.733) (17.743) (17.797)
## alcohol 0.314*** 0.333*** 0.311*** 0.312*** 0.309*** 0.265***
## (0.016) (0.018) (0.020) (0.021) (0.022) (0.022)
## volatile.acidity -1.384*** -1.365*** -1.272*** -1.273*** -1.269*** -1.046***
## (0.095) (0.096) (0.100) (0.100) (0.100) (0.103)
## density 21.360* -12.922 -11.129 -10.271 -39.487*
## (10.228) (15.214) (17.770) (17.782) (17.843)
## fixed.acidity 0.045** 0.044** 0.045** 0.056***
## (0.015) (0.016) (0.016) (0.016)
## residual.sugar -0.003 -0.002 0.013
## (0.014) (0.014) (0.014)
## chlorides -0.436 -1.729***
## (0.365) (0.394)
## sulphates 0.891***
## (0.113)
## -------------------------------------------------------------------------------------------
## R-squared 0.3 0.3 0.3 0.3 0.3 0.3
## adj. R-squared 0.3 0.3 0.3 0.3 0.3 0.3
## sigma 0.7 0.7 0.7 0.7 0.7 0.7
## F 370.4 248.9 189.9 151.9 126.8 121.7
## p 0.0 0.0 0.0 0.0 0.0 0.0
## Log-likelihood -1621.8 -1619.6 -1615.0 -1615.0 -1614.3 -1583.8
## Deviance 711.8 709.9 705.8 705.8 705.1 678.8
## AIC 3251.6 3249.3 3242.0 3244.0 3244.6 3185.6
## BIC 3273.1 3276.1 3274.3 3281.6 3287.6 3234.0
## N 1599 1599 1599 1599 1599 1599
## ===========================================================================================
Based on plots in this section and the previous section, I created a linear model to included variables investigated in the past: quality, alcohol, valotile acidity, density, fixed acidity, residual sugar, chlorides and sulphates. The variables in this linear model can account for only 30% of the variance in the quality of red wine. With the addition of variable, the R-squared value does not improve. Hence the linear correlation between quality and the independent variables are weak.
No. As the result of the linear regression model shows, there does not appear to be features that strengthen each other.
From the plot, it is noted, good quality wines seems to have relatively high pecentage of alcohol and relatively low amount of volatile acidity.
A linear model was built for the dataset. It is a simple and easily interpretable model. However, due to the relatively weak linear correlation between variables, this might not be the best model to be used for this particular dataset.
The distribution of red wine quality is normal and peaks at 5 and 6.
Alcohol and volatile acidity are two variables that have relatively strong correlation with red wine quality. The two boxplots above show clear trends that quality and alcohol is positively correlated whereas volatile acidity is negatively correlated with red wine quality.
This plot has group all quality with both alcohol and volatile acidity. It is noted that higher quality wine (rating above 5) are correlated with higher percentage of alcohol and volatile acidity. Althought the overall trends shows red wine quality increase as percentage of alcohol increase, there is a dip at rating 5 where the median of alcohol is lower then rating 4.
This data set contains information on 1,599 different red wine. The purpose of the analysis was to determine which chemical properties affect the quality of wine. I started my analysis by understanding each variable. Then, I explored the correlation between each variables. None of the correlations were above 0.7. Among all variables, alcohol and volatile acidity had a relatively strong correlation with quality. Those two variables were picked for further analysis. Additionally, a few pairs of other variables were picked for analysis, such as density. Finally, both alcohol and volatile acidity were included in the plot with quality within multivariate analysis and a linear model was built. Per the result from the model, it is difficult to draw a definitive conclusion of which variables can affect the red wine quality significantly.
One limitation of this dataset is that the sample size is relatively small. Although there are more almost 700 observations for rating 5 and 6, there are less than 70 observations under rating 3 and 4 and only 18 observations for rating 8. If there is any outlier within rating 3, 4 or 8, the overall trend will be inevitably affected. Another suggest will be include more categorical variables into the dataset, such production location which is an important factor or red wines.